Optimizing Multi-Target Detection in Stochastic Environments with Active Smart Camera Networks

Active Smart Camera Networks (SCNs) have a wide range of applications in areas such as banks, airports and underground terminals, where it is necessary to monitor open plan spaces reliably and robustly. However, the monitoring efficiency in such environments is affected by various factors which are stochastic in nature, such as the target movement and arrival times, as well as the capabilities of the on-board camera detection module. This paper presents a modelling framework and a subsequent optimization-based solution towards the optimal reconfiguration of the camera network in order to minimize the expected number of undetected targets under uncertainties. The proposed solution is applicable to a variety of scenarios and does not require to have a full view of the area. Simulation results indicate that the proposed solution is robust to varying conditions and is able to achieve good monitoring performance with only a few cameras.


INTRODUCTION
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. Smart Camera Networks (SCNs) of actively controlled pan-tilt cameras with advanced sensing and processing capabilities, are progressively being deployed in the monitoring of public and private open plan spaces such as airport and underground terminals, government and military premises [12], [2]. As the size and complexity of camera networks increase, there is a need for improved robustness, reliability, scalability, adaptability, and reduced human intervention especially with regards to actively controlled cameras (e.g., pan-tilt capabilities) [7]. One of the most important challenges faced towards achieving these goals is related to the stochastic nature of the monitored environment such as the arrival time and trajectory of targets of interest. In addition, there are also uncertainties in the camera object detection modules, which if not considered may result in undetected targets 1 especially when considering low-cost embedded smart cameras with limited processing capabilities. Hence, it is necessary to develop intelligent algorithms to actively control the cameras to collaboratively and autonomously monitor the whole area and to detect as many targets as possible under uncertainties. Furthermore, it is often the case that cameras may not be able to offer adequate coverage of the whole monitored area at once. Hence, it is also important to develop exploration methods for the observation of new targets that enter the monitored area.
This work presents a novel modelling framework that considers the stochastic nature of target movement and arrival times, to represent the expected number of targets over space in the monitored area. Using this framework, a novel optimization-based approach is further developed to obtain the configuration of each camera (i.e., pan and tilt angles) that minimizes the miss-detection probability of each target over time, hence, maximizing the expected number of targets detected at least once. The proposed solution leverages on a probabilistic model that captures the detection performance of cameras as a function of distance that we proposed in [1]. The modelling framework and corresponding solution approach are evaluated through simulations based on the specifications derived from a Raspberry-Pi-based pantilt camera, and is compared to the ideal case where the target positions are a-priori known, demonstrating comparable detection performance and better execution time.
The rest of this paper is structured as follows. Section 2 outlines some key areas of emerging research related to this work. In Section 3 we formally introduce the problem as well as assumptions for the visual sensors, targets, and the proposed framework. In Section 4 we formulate an optimization algorithm that utilizes detection probability information in order to identify new camera configurations that maximize the overall target detection probability. In Section 5 we present the evaluation results for the proposed framework through simulations. Finally, Section 6 provides concluding remarks and directions for future work.

RELATED WORK
A large body of research in SCNs relevant to our work concerns area and target coverage [7]. In the former case the main objective is to fully cover an observed area and monitor the activity within it. In the latter case the main objective is to cover some particular PoI that correspond to targets with known positions [9], [4], [13]. In such works the cameras are assumed to be able to monitor a spherical sector and change their horizontal orientation to observe different areas by selecting a pan angle from a discrete set. In this setting a target is either covered or not covered by a camera, hence, the notion of the camera detecting and extracting target information is not present. To the best of our knowledge, there is no relevant study that addresses the same problem of maximizing the detection performance of SCNs for moving targets under uncertainty. However, to highlight our contributions we provide an overview of related works and discuss the main differences with our approach.
In [6] a particle swarm optimization technique is used to optimize the coverage both locally for targets and globally for the area. The local coverage is based on a quality of view metric which is characterized as a Gaussian distribution that depends on the distance of a target from the camera, while only zoom and pan are controllable parameters. We go beyond this work by accounting for the actual detection performance of cameras, which can vary for different targets and distances, instead of static targets. In addition, we identify activity areas where new targets may appear and hence, we do not need to constantly view the whole area. The work in [8], addresses the problem of covering each target by a minimum number of cameras in a balanced way as to ensure adequate monitoring in the presence of faults or blocked lineof-sight, however, they do not examine detection aspects. In our case we impose no limits on the number of cameras, as we use the actual detection performance of the available cameras (which can also encapsulate the presence of occlusions). In [9] a Centralized Force-Directed Algorithm is proposed to maximize the number of targets covered. This work solves a basic instance of the problem with static targets so that the movement and discovery of new targets are not considered. A study on full-view coverage, where the targets need to be viewed with the least number of cameras but from different viewpoints, is presented in [5]. Even though it is a very important problem the work considers only cameras which can change their pan orientation and do not investigate the computer vision parameters. Natarajan et al. [10], tackle the problem of target tracking by a hybrid network of active and static cameras, through a Markov Decision Process. In contrast, no static cameras are used to provide target positions in our work. Instead, we exploit the stochastic nature of target generation and motion to build an activity map [11] within the area, and then use the probabilistic camera model to maximize the detection performance over time as new targets enter the area.

MODELLING
In this paper we investigate how to maximize the detection of multiple targets in the area using only active cameras, without any additional sensors or static cameras to provide their location. To tackle this problem we model various aspects in the environment such as entrance/exit points, target mobility and arrival rates, the area under surveillance, and camera sensing and detection algorithms. We utilize the modeling framework to develop an exploration algorithm and a suitable optimization algorithm that together can direct the active cameras in monitoring locations where undetected targets are most probable to appear.

Entrance/Exit Points and Paths
We assume that the monitored area has a total of N E entrance/exit points. Each entrance/exit point r is located A target can enter from point r1 and exit from point r2, r1 = r2; such that this pair of points defines path p l , with P(l) = {r1, r2}. The rate at which targets arrive at point r1 and subsequently exit from point r2 is characterized by a Poisson distribution P (μ) with parameter μ = λ l ×τ where λ l is the mean number of arrivals per time step for path l, and τ is an interval of time.

Targets
Each target j ∈ T that enters the area moves with a constant speed uj generated from a Gaussian distribution with mean μu and variance σ 2 u , uj ∼ Gaussian(μu, σ 2 u ). In our modelling framework we assume that each target moves along the shortest path between its entrance and exit point, which is a straight line formed by the points r1, r2} when there are no obstacles and is characterized by an inclination angle θ l . However, the true target trajectory can deviate from the straight line. Each target is characterized by a position vector Xj in the field (x T j , y T j ) and has a distance dij from each camera i. It is assumed that a target enters the area at time t a j and departs at time t d j , hence, the total time that the target is present within the field is tj = t d j −t a j . We assume that target j is traversing the field according to the dynamic discrete-time linear model where ωj(t) ∈ R 2×1 ∼ N (0, σ 2 j ), is the motion noise vector, and Δt is the time-step of the discrete-time model.

Camera and Sensing Model
We consider a set of active camera nodes i ∈ C with two degrees of-freedom (DoF), the pan and tilt angles. The position of each camera i in an area is denoted by ( Figure 1: Sensing and detection model for downwards looking cameras.
whereas its configuration and possible FoVs that it can view are determined by the tuple ( which are the pan and tilt angles, the vertical and horizontal viewing angles, and the height that it is located at. We assume that Hi,θ V i , and θ H i are fixed, so that the local FoV fi is determined using only θ P i and θ T i . The whole area that a camera can monitor with its whole range of angles is Fi. These angles are adjusted by the corresponding motorized pan-tilt stage on which the camera is mounted and are bounded as follows: By projecting the 3D visual sensing camera model of camera i onto the 2D field, the camera FOV can be approximated by a trapezoid shape, as shown in Fig. 1, determined by (θ P i , θ T i ). Each pair of angles defines a different configuration k from all possible configurations Kit that camera i can have at time t, which can change over time depending on the activity in the area.

Sensing Model
The sensing model, first introduced in [1], attempts to capture how different parameters such as the proximity of a point of interest that we want to monitor from the cameras, the noise, viewpoint of each camera, and the probabilistic nature of the underlying machine learning algorithms impact the performance of object detection algorithms and subsequently the monitoring efficiency. In this description we assume that the point of interest is the actual target but the approach is also applicable when we want to monitor locations of interest for potential targets, as discussed in Sec. 3.4. The model can be summarized as follows: The total area Fi that can be monitored by camera i is segmented into N Z zones Zim,where m = 1, · · ·, N Z where N Z is the last zone that is located further away from the camera. Each Zone m is the locus of points within the global FoV of camera i with distance between D Z i(m−1) and D Z im from the camera's origin, as shown in 1. For each configuration that a camera can have it views a subset of the m zones which belong to its local FoV fi(θ T i , θ P i ). Each zone is characterized by different detection probabilities that depend on the sensitivity of the machine learning algorithm to the visibility of the features, illumination changes, and noise. However, it is assumed that each zone is characterized by a single constant detection probability which is the average of the different probabilities within that zone. Hence, when a target is in zone Zim of camera i it is assumed that on average is detected with probability P Z im . Note that the detection probability P Z im of zone m is not given by a specific equation but is inherent to the probabilistic nature of the machine learning algorithm used for object detection. A camera can determine Zim that a target j is detected in through trigonometry using the pan and tilt angles (θ P i , θ T i ) and height Hi. In each time step t, camera i monitors targets located in its FoV based on the aforementioned model. We can evaluate how well a camera performs as follows: The FoV corresponding to its current configuration k ∈ Kit contains a set of targets S ikt , where each target j in the set is detected with non-zero detection probability It follows that the miss-detection probability of target j from camera i using configuration k can be defined as Q ijkt = 1 − P ijkt . When a target j is observed by multiple cameras the overall probability P o j with which j is detected at least once in the entire time horizon H is given by 2 :

Camera Configurations
Given a set of PoI (e.g. targets or locations in an area), we are interested in finding the best possible configuration k for each camera i that meets the given constraints for detection performance. To do so, firstly all possible configurations Kit for a camera with corresponding detection probabilities for each point must be found. To avoid the brute force approach, a systematic procedure for each camera i is performed that involves generating a finite number of configurations. This procedure is executed by each camera, and takes as input the position coordinates of the points to be monitored. The output of this process is a list of possible configurations for each camera with the respective miss-detection probabilities for each point. This list will be used as input to the optimization algorithm in order to find the overall best configuration for each camera in order to minimize the miss-detection probabilities. In case of moving PoI then the set of possible configurations Kit will change over time; however, when the points are static then it is only necessary to find it once. A camera can determine the corresponding detection probabilities by forming different FoVs around anchor points which determine unique configurations and a corresponding FoV. This procedure is repeated for each target/point in the area to generate the configurations for a specific camera.The reader is referred to [3] for more details.

Area and Path Segments
In order to model the expected number of undetected targets in the area we first segment the monitored area into a grid of cells g ∈ G = {1, · · · , Ng}, where Ng is the amount of cells, with cell g centered at (x G g , y G g ). Furthermore, given that a target is moving with mean velocity uj we can estimate its position in the path at future time-steps. Hence, we divide each path l into a total of N s l segments, such that each cell contains a number of segments for a particular path. We denote the index of the cell that each segment s of path l is located in as g l (s). We assume, without loss of generality, that the miss-detection probability of a target at Algorithm 1 Estimation of area activity and Cell miss-detection performance 1: At t = 0 initialize E ls0 = 0; ∀ l, s (1) Propagate the expected number of targets 2: for (l = 1 : N l ) do 3: for (s = 2 : N s l ) do 4: E lst = E l(s−1)t 5: end for 6: E l1t = λ l 7: end for (2) Update cell weights 8: for (l = 1 : N l ) do 9: for (s = 1 : N s l ) do 10: W g l (s)t = W g l (s)t + E lst 11: end for 12: end for (3) Update cell miss-detection probabilities 13: for (g ∈ G) do 14:

OPTIMIZATION AND RECONFIGURA-TION ALGORITHM
We utilize the modeling framework in Section 3 to develop a solution to the problem of maximizing the number of targets to be detected at least once. The solution relies on predicting the likelihood of undetected targets occurring at a specific region in the area which depends on the paths followed by the targets, the target arrival rates and the subsequent camera detection performance. Through appropriate optimization the cameras can then be reconfigured to increase the prospect of detecting those targets.

Area Activity Estimation
The expected activity within a cell is determined by the activity propagating from all paths that go through that cell. We model this via a data vector for each path that maintains parameter E lst representing the expected number of undetected targets along segment s of path l at time t. This structure can be thought as a First-In-First-Out (FIFO) queue, with length equal to the number of segments in the path N s l , that propagates the expected number of undetected targets per time step along each path. By aggregating the contribution from different E lst values associated with cell g we can find a weight for that cell, Wgt, which represents the expected number of undetected targets. Combining this with the miss-detection probabilities at those cells we can quantify the potential for the detection of new targets within different cells, which is used to decide on the best reconfiguration of the cameras.
Algorithm 1 shows how to update the expected number of undetected target at each cell segment E lst and weight cell Wgt. At time 0 in line 1 E lst is initialized to zero. The first step is to simulate the propagation of undetected targets within the area by shifting the values of E lst as shown in line 4. At the same time we initialize the first segment of each path with the arrival rate per time step for that path λ l as shown in line 6. The second step is to update the cell weights by summing the contributions of each cell segment as shown in line 10 to give the total number of estimated undetected targets per cell. After this step the optimization algorithm outlined in Sec. 4.2 is executed to reconfigure the cameras with the new cell weights. The new configurations result in updated miss-detection probabilities for each cell which are computed in line 14. Accordingly, in line 18, we also update the per segment miss-detection probabilities which correspond to the cell g l (s) that the segment s belongs to. The final step is to also update the expected number of undetected targets per segment in line 23. This is done by multiplying the expected number of targets in that cell with the cell overall miss-detection probability, resulting in a new expected number of undetected targets.

Optimization problem
In order to maximize the expected number of detected targets we want to direct the cameras to monitor cells with a high probability of containing undetected targets which is expressed by Wgt. This problem is equivalent to minimizing the weighted overall miss-detection probability for each cell. Recall from Section 3 that the overall miss-detection probability for a point of interest is equal to the product of miss-detection probability of all cameras that cover it. Hence, we formulate the following optimization problem to find a solution.
Notice that the objective of (3) is nonlinear and solution with standard solvers is not possible. Nonetheless, we are able to transform the problem into a mixed-integer linear program, which can be solved with standard optimization software, using an appropriate problem reformulation and subsequent piece-wise-linear approximation approach. More details can be found in [3], where the approach was first introduced for the case that Wgt = 1, ∀ g, t.

EVALUATION RESULTS
To evaluate the proposed framework we developed a simulation environment in Matlab which incorporates all the modelling aspects introduced in Section 3. We simulate a square area of 20 × 20m 2 where 12 entrance/exit points are generated. Four cameras are employed to monitor the area one at the center of each side. Different paths are formed between the points and targets enter from each point with different rates. The camera runs a pedestrian detection module for which we use the model established in [3], which is a 5zone model with probability values (0.2,0.5, 0.8, 0.5, 0.2), resulting in a bell-shaped probability distribution, with an effective capturing distance of 3 to 9 metres for each camera.
We run a total of 200 different simulation scenarios and record the number of targets missed, and how performance is affected from various parameters such as the noise in target position, arrival rate, number of paths, and target velocity. In addition, since there are no related works that tackle the same problem we compare the performance of our approach to an ideal case scenario, labeled as oracle, where the target positions are known and the cameras reconfigure through optimization to improve the detection performance. Since the oracle knows all target locations and does not have to discover new targets as in our case it is expected to perform better. However, it serves as a benchmark which indicates how close is the developed approach to the optimal solution.
The results are illustrated in Fig. 2-6. First, Fig. 2 shows  the percentage of missed targets with respect to noise on the position, or in other words the deviation of their position from their intended path. As expected the oracle performs better, having a value of around 3% compared to 6-9% for the proposed algorithm. However, notice that our approach maintains steady performance even if the target trajectories deviate significantly from a straight line. Fig. 3, indicates that as the velocity of targets increases the performance of our approach deteriorates; this is because as the speed of targets increases it reduces the time units that a camera views the targer and so it has less chance to detect a new target. This can be improved by adding more cameras to increase the chances of detection. On the contrary, there is only a slight deterioration on the performance of the oracle algorithm as the target positions are always known. In Fig.  4, we can see that as the rate of target arrivals increases, the performance of the algorithm remains steady at 5-6%. This is because through the proposed approach the cameras keep covering important hot-spots in the area and thus keep up with the increasing arrival rate. On the other hand, the "oracle" performance deteriorates in this case as the algorithm has to choose between covering different targets. Finally, in Fig. 5 we observe that as the number of paths in the area increase the performance of the oracle deteriorates while for the proposed approach the performance maintains some regularity at 5-6%. This is again because in the oracle case the cameras have to choose between targets while in our case we monitor areas with high probability of undetected targets passing through. We also perform experiments with regards to the execution time of the optimization for the two approaches with regards to the number of paths. As it is illustrated in Fig. 6, the proposed approach is almost constant at 0.05 sec since it depends on the number of cells which is constant, while the oracle execution time increases when more targets need to be considered in the optimization.

CONCLUDING REMARKS
This paper introduced a framework that considers stochastic parameters in the environment such as the camera detection performance and target movement to dynamically adjust the pan and tilt angles of active cameras to cover different regions, thus maximizing the expected number of unique targets detected over a time horizon. Overall, one of the main conclusions from the simulation results is that by incorporating known statistical information such as entry rates and known path trajectories can greatly improve the monitoring efficiency of SCNs. The proposed approach manages to perform close to the optimal solution and at the same time remain computationally independent from the number of targets, without having to keep a view of the whole area. Future research will utilize the actual camera detections rather than theoretical values as feedback, to improve the estimation of the location of undetected targets. Within our immediate plans is to also model more complex scenarios (target trajectories, no knowledge of arrival rates, etc.). Another direction of work is to experimentally evaluate the system under real-world conditions to gain more insight into the overall performance and implementation challenges.

ACKNOWLEDGMENTS
This work was supported in part by the ERC Advanced Grant "Fault-Adaptive", ERC grant agreement No 291508. The dissemination of this work is being supported by the European Union's Horizon 2020 research and innovation programme under grant agreement No 739551 (KIOS CoE).